details widget name

Overview

Chapter details

Introduction

Document classification is the task to assign a document to one or more categories or classes. Automating this process is of great importance for modern applications, therefore, a variety of methods has been developed during the years.

The methods for automatic classification can be informally divided into two groups – statistical algorithms and structural algorithms. Examples for statistical algorithms are Regression and Naïve Bayes. Structural algorithms can be further divided into Rule Based (Decision Trees, Production rules), Distance Based (kNN, Centroid) and Neural Networks (Marmanis, Babenko, 2009).

Single-label classification is concerned with learning from a set of documents that are associated with a single label (class) l from a set of labels L. In multi-label classification each document can be associated with more than one label from L. If L contains exactly two labels the learning problem is called binary classification; if L contains more than two labels the problem is called multi-class classification (Tsoumakas, Katakis, 2007).

Categorization tasks

A categorization task consists of two base phases – training and classifying. The training phase processes a set of labeled documents to create a model. The classifying step uses the model to assign one or more labels to unlabeled documents.

During both phases, each document is represented as a set of features. These features are later used to create models for the different classes. Depending on the classification algorithm a feature reduction method could be applied during the processing. The Language Processing Framework (LPC) (see chapter “Language processing chains”) processes each document and provides access to different types of features - tokens, lemmas, noun phrases, head tokens. This allows one algorithm to be set up to work with different types of features. Moreover, the categorization module can host several algorithms simultaneously. The results from the different classifiers are combined and the classification result is determined by a majority voting system.